Download the template notebook here.
If you don’t know the shape of your data, you might not draw appropriate conclusions about the statistical tests you perform.
Check out the Datasaurus Dozen, which all have the same x/y means and standard deviations, sometimes called the Simpson Paradox:
Dataviz is also engaging, communicative, creative, efficient, and fun.
There are some ways to create plots quickly, and it’s good to know a little about that before we get too far.
# plot()
We will talk more about what to do with these plots later, but I want to show you a Quantile-Quantile plot (QQ plot) output so you can see how easy it is to make even before we discuss why you’d want to make it.
#qqplot()
#qqnorm()
#qqline()
Using the tidyverse, we can create beautiful, engaging, informative plots. To do so, we will build up the plot in layers. This might seem to be a bit of a faff at first, but it ends up being powerful (and easy once you know how the components work).
# base plot
This is the background of the plot – a check to see that penguins is something that can be plotted from.
To start adding things to the plot, we need to specify what we want the plot to extract from penguins.
# add aesthetics
Now the plot knows a bit more about what we’re asking, but not enough to show up the data. This is how ggplot() differs from just plot(). By itself, plot() infers how we want to display our data. This is great when it’s correct and not great when it’s wrong (which it often is, without additional specifications). In contrast, ggplot() requires the specifications from the start, but they’re integrated more smoothly.
# simple scatter plot
You can get rid of the message at the top (which we don’t care about) by adding , warning = FALSE after the code chunk identifier:
{r, eval=FALSE,, warning = FALSE}
You can do a lot of other neat stuff here, but you’ll have to look that up on your own time.
Up to this point, we’ve done exactly what the simple plot() command can do. Now, we want to go beyond.
# colour species differently
To replicate this plot in base R with plot(), we’d need to manually subset and plot each species separately, which is a pain, doesn’t include all the same options, and yet takes a LOT more code (plus it still doesn’t look as nice, in my opinion):
# plot() is part of base R
So from now on, we’ll be working with ggplot() only.
One more thing: it’s important to make your plots visually accessible to a broad audience. The default ggplot() has a grey background, which ends up causing problems more often than a white background would, so we’ll also always add the theme_bw() option to our plots from here out. It’s optional but recommended.
Let’s take a look at what’s needed to make a histogram or density plot.
# think about when it's appropriate to use histograms vs density plots
Another common and useful type of plot is the box and whisker plot.
# box plots are easy and demonstrate how crossed designs are easy to plot
Bar plots are sometimes controversial, but they can also be very useful. They take slightly different arguments than other types of plots because of how the bar height is ‘calculated’.
# bar plots can require some extra prep work
You may also want to explore fancier types of plots, or combine types we’ve already encountered. This is easy with ggplot()’s modular construction and visual grammar.
# overlay violin, box, and points
Statistics is (like) dangerous dark magic: if you know what you’re doing, it means you’ve dedicated your life (soul) to it and have no time or capacity to do other things. If you don’t know what you’re doing, it can hurt you or people around you. If you’re somewhere in the middle, it is best to go slow, hedge your bets, and use it judiciously.
Why is statistics something to be wary of?
Let’s create our own toy dataset:
set.seed(18) # 15 16
x = rnorm(50) # 50 random numbers from a normal distribution
y = 2 * x + 5 # for each x, multiply by 2 (slope) and add 5 (intercept)
Here is what this data look like. Too perfect, everything on a perfect line, even with randomness:
# create a table with two columns
# establish the base of a plot
# use points to plot the data
# use a nice theme
# add a red line with the specified slope
Let’s add more noise, like any complex system would have:
e = rnorm(50) # random noise
# realistic model
y2 = 2 * x + 5 + e # slope = 2, intercept = 5, random noise ("error" or epsilon) = e
tbl1 <- tibble(x, y2) # combine into a dataset
How does the new noise change the data?
# using this dataset
# create a base plot with x on the x-axis and y2 on the y axis
# make it pretty
# make it a scatter plot (points)
# add the red line to indicate intended slope and intercept
Is the red line still the best way to approximate this data?
model1 <- lm(y2 ~ x) # calculate slope and intercept automatically
What are the calculated slope and intercept?
coef(model1) # `coef` stands for coefficients
Plot the data with the intended shape of the data (red) and the calculated shape (green):
# using the toy dataset
# establish the base of the plot
# make it pretty
# draw the data as points
# add a vertical blue line at the "intercept" (y axis)
# add the intended shape of the data (red)
# add the calculated shape (green)
How do the intended shape and calculated shape differ? Why?
We can actually extract this numbers (and more!) in a fancy looking output summary:
summary(model1)
Moreover, there’s actually a way to do this calculation within a plot:
# geom_smooth() calculates the simple linear regression within the ggplot() environment
simdat <- read.csv("../data/simulated-data.csv", header = TRUE)
Using the simulated data set from last time, make and interpret the following plots:
age with the conditions created by crossing the Frequency factor (freq) with the Grammaticality factor (gram).
age is numeric and (ostensibly) continuous, try using geom_smooth(), among other options.colour and linetype for the two factors if you wish. fill is also available.
geom_boxplot()) to plot each of the five regions reactions times, illustrating the four conditions.
facet_grid() or facet_wrap(). Look them up and learn how to use them.fill aesthetic, for example.ggplot().interaction() or you may want to mutate() a column in advance.Go through the following code line by line. Using an internet search engine, the R documentation Help window, and selectively changing or commenting out code, identify what each line does. Take notes by adding comments (# like this) after each line.
simdat %>%
mutate(region = as.factor(region)) %>%
group_by(freq, gram, region) %>%
summarise(mean.rt = mean(rt),
se.rt = sd(rt)/sqrt(n())) %>%
ggplot(aes(x = region,
y = mean.rt,
group = interaction(freq, gram),
colour = gram,
linetype = freq)) +
theme_bw() +
geom_point() +
geom_path() +
geom_errorbar(aes(ymin = mean.rt - se.rt,
ymax = mean.rt + se.rt),
width=.2) +
scale_x_discrete(labels = c("the", "old", "VERB", "the", "boat")) +
scale_color_manual(values = c("grey20", "grey60")) +
ylab("reaction time (ms)") +
xlab("region of interest") +
ggtitle("Self-paced reading time across all regions",
subtitle = "Shaded areas indicate significant main effects") +
annotate(geom="rect",
xmin = 2.6,
xmax = 3.4,
ymin = 350,
ymax = 430,
alpha = .2) +
annotate(geom="rect",
xmin = 4.6,
xmax = 5.4,
ymin = 390,
ymax = 470,
alpha = .2) +
annotate(geom="text",
x=3,
y=427,
label="*",
size=10) +
annotate(geom="text",
x=5,
y=467,
label="*",
size=10) +
NULL